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Abstract 

Practical application of genomic-based risk stratification to clinical diagnosis is appealing yet performance varies widely 
depending on the disease and genomic risk score (GRS) method. Celiac disease (CD), a common immune-mediated illness, is 
strongly genetically determined and requires specific HLA haplotypes. HLA testing can exclude diagnosis but has low 
specificity, providing little information suitable for clinical risk stratification. Using six European cohorts, we provide a proof- 
of-concept that statistical learning approaches which simultaneously model all SNPs can generate robust and highly 
accurate predictive models of CD based on genome-wide SNP profiles. The high predictive capacity replicated both in cross- 
validation within each cohort (AUC of 0.87-0.89) and in independent replication across cohorts (AUC of 0.86-0.9), despite 
differences in ethnicity. The models explained 30-35% of disease variance and up to ~43% of heritability. The GRS's utility 
was assessed in different clinically relevant settings. Comparable to HLA typing, the GRS can be used to identify individuals 
without CD with >99.6% negative predictive value however, unlike HLA typing, fine-scale stratification of individuals into 
categories of higher-risk for CD can identify those that would benefit from more invasive and costly definitive testing. The 
GRS is flexible and its performance can be adapted to the clinical situation by adjusting the threshold cut-off. Despite 
explaining a minority of disease heritability, our findings indicate a genomic risk score provides clinically relevant 
information to improve upon current diagnostic pathways for CD and support further studies evaluating the clinical utility 
of this approach in CD and other complex diseases. 
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Introduction 

Improving the diagnosis of celiac disease (CD), a common 
immune-mediated illness caused by dietary gluten, remains a 
clinical challenge [1,2]. Despite a prevalence of approximately 1% 
in most Western countries, lack of awareness and failure to 
implement appropriate serological, histological and genetic testing 
means that less than 30-40% of those affected by CD are diagnosed 
[1,3-5]. Undiagnosed CD is associated with reduced quality of life, 
substantial morbidity, and increased mortality, however, prompt 
diagnosis and treatment lowers the burden of disease and may 
reduce the rate of complications such as osteoporosis, autoimmune 
disease, and malignancy. Optimizing the diagnosis of CD is now 
recognized as an important goal for clinicians [6]. 

CD is characterized by a variable combination of gluten- 
dependent clinical manifestations, CD-specific antibodies and small 
bowel inflammation (villous atrophy) [7] . Traditional guidelines for 



the diagnosis of CD rely on demonstrating villous atrophy and 
improvement of symptoms, laboratory abnormalities, and/ or small 
bowel inflammation upon exclusion of dietary gluten [8]. Current 
clinical practice is to screen for CD by detecting CD-specific serum 
antibodies and then confirm the diagnosis by undertaking small 
bowel biopsy to demonstrate typical villous atrophy. Serologic 
screening for CD with transglutaminase-IgA antibodies is reported 
to be highly sensitive and specific for CD (both >90%), imparting a 
high positive predictive value (PFV) of over 90% when assessing 
most populations [9,10], although the PPV can fall to 45-70% in 
community screening settings [11,12]. In practice, serological and 
histological assessments have technical limitations that generate 
both false negative and false positive diagnoses. 

A key feature of CD is its strong dependence on the presence of 
susceptibility genes encoding for HLA DQ2.5, DQ8, and/or half 
the HLA DQ2.5 heterodimer (typically DQ2.2), seen in approx- 
imately 99.6% of all patients with CD [13]. These genes encode 
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Author Summary 

Celiac disease (CD) is a common immune-mediated illness, 
affecting approximately 1% of the population in Western 
countries but the diagnostic process remains sub-optimal. 
The development of CD is strongly dependent on specific 
human leukocyte antigen (HLA) genes, and HLA testing to 
identify CD susceptibility is now commonly undertaken in 
clinical practice. The clinical utility of HLA typing is to 
exclude CD when the CD susceptibility HLA types are 
absent, but notably, most people who possess HLA types 
imparting susceptibility for CD never develop CD. There- 
fore, while genetic testing in CD can overcome several 
limitations of the current diagnostic tools, the utility of 
HLA typing to identify those individuals at increased-risk of 
CD is limited. Using large datasets assaying single 
nucleotide polymorphisms (SNPs), we have developed 
genomic risk scores (GRS) based on multiple SNPs that can 
more accurately predict CD risk across several populations 
in "real world" clinical settings. The GRS can generate 
predictions that optimize CD risk stratification and 
diagnosis, potentially reducing the number of unnecessary 
follow-up investigations. The medical and economic 
impact of improving CD diagnosis is likely to be significant, 
and our findings support further studies into the role of 
personalized GRS's for other strongly heritable human 
diseases. 

immune-recognition molecules which facilitate CD4+ T cell 
recognition of specific gluten-derived peptides, a critical step in 
disease pathogenesis [14—18]. Recognizing the crucial role of these 
genes, the latest consensus diagnostic guidelines for CD recom- 
mend testing for these HLA heterodimers (HLA typing) as a first- 
line investigation for asymptomatic individuals identified at-risk of 
CD, such as T'-degree relatives of an affected individual or those 
with suggestive symptoms [7]. However, a major flaw of HLA 
typing as a diagnostic tool is that a substantial proportion of the 
community, typically reported to be 30-A0%, express HLA 
DQ2.5, DQ8, and/or DQ2.2, thus making the presence of these 
HLA types poorly predictive and of low specificity for CD [13]. 
Indeed, a recent Australian population study revealed that 56% of 
the community possessed at least one of these CD susceptibility 
haplotypes [5]. Thus, while HLA typing can exclude CD in the 
community with high confidence when the susceptibility haplo- 
types are absent, these haplotypes will be present in 30-56% of the 
population, the majority of whom would not have CD. Therefore, 
if assessed as a stand-alone test, HLA typing has exceptionally high 
sensitivity and negative predictive value (NPV) but very poor 
specificity and low positive predictive value (PFV) for CD. Since a 
positive result poorly predicts the presence of CD, HLA typing is 
not useful as a stand-alone diagnostic tool for CD. While the 
relative-risk for CD can be stratified based on the HLA subtype 
(CD risk DQ2.5>DQ8>DQ2.2) [19], these categories have low 
positive predictive value and do not provide clinically-informative 
attribution of CD risk [20]; HLA results are therefore interpreted 
as a binary outcome: CD susceptibility positive or negative. 
Despite these limitations, HLA typing is now widely utilized in 
clinical practice and typically determined using polymerase chain- 
reaction sequence specific oligonucleotide (PCR-SSO) hybridiza- 
tion, which is time and labor intensive, and cosdy (AU $120/ 
sample, Medicare; in the USA cost varies but is typically US $150/ 
sample or greater). 

It is important to distinguish between three different approaches 
to analyzing the HLA region for association with CD. The first 
approach, currently in clinical practice, is HLA typing, as 



described above, where the HLA result is considered a binary 
variable and its utility is to exclude CD. A second approach, such 
as that taken by Romanos et al., utilizes the same HIA-DQ 
haplotypes, stratifies individuals into several nominal risk levels 
then fits a statistical model to empirically estimate the true risk in 
each group [21,22]. While HIA-DQ haplotypes may be inferred 
from typing several HLA SNPs, importantly the HLA SNPs are 
only used to assign the HLA type and the SNPs themselves are not 
directly modeled. The third approach, such as that used here, is 
based on direct concurrent modeling of many thousands of 
individual SNPs for association with CD in order to produce a 
more fine-grained predictive "genomic risk score" (GRS). 

GRSs have been enabled by the advent of genome-wide 
association studies (GWAS), which perform unbiased testing of 
many thousands of SNPs for association with CD. Using GWAS, 
recent studies have identified multiple non-HIA SNP associations 
with CD [23,24]. GWAS are primarily concerned with the 
detection of variants associated with disease in order to gain insight 
into the disease etiology and genetic architecture. Due to the high 
number of significance tests, controlling for false positive 
associations is a major, valid concern. Therefore, SNP-based risk 
scores have tended to be constructed from the SNPs found to be 
significandy associated with the disease status [22,25]. However, 
due to the stringent multiple-testing corrections utilized in GWAS 
there may be other SNPs that fail to achieve genome-wide 
significance but may be predictive of disease status nonetheless and 
including them in the model could potentially result in higher 
predictive ability than achievable by models based solely on 
genome-wide significant SNPs. In contrast to the GWAS 
approach, the main overriding aim of a GRS from a clinical 
perspective is to achieve maximal predictive capacity, the 
inference of genetic architecture is secondary. 

We have recendy designed computational algorithms which 
efficiently fit LI -penalized multivariable classification models to 
genome-wide and whole-genome SNP data [26]. Such models 
were then shown to be preferable to several other methods such as 
the standard method of summing the per-SNP log odds (polygenic 
score) [27], mixed effects linear modeling [28,29], and unpena- 
lized logistic regression, with both better precision for detecting 
causal SNPs in simulation and better case/control predictive 
power [30]. These advantages were consistent across several 
complex diseases, including two British studies of CD. However, 
the diagnostic implications of penalized models have not been 
previously examined nor has the robustness of such models in 
other populations or the advantage over HLA-typing approaches. 
In contrast to existing studies that examine a small number of 
genome-wide significant SNPs, we have shown that many more 
SNPs (potentially hundreds) are required to achieve optimal 
predictive ability for CD. Further, the standard GWAS approach 
of considering each SNP separately when estimating its effect size 
does not consider its correlation with other SNPs. We have shown 
that unpenalized predictive models based on these top SNPs suffer 
from lower predictive ability than LI -penalized models since the 
pre-screening introduces multiple highly correlated SNPs into the 
model, of which a substantial proportion may be redundant in 
terms of contribution to the predictive ability. Similar Ll- 
penalized approaches have also recently been successfully applied 
to inflammatory bowel disease case/control Immunochip data, 
where models based on several hundred SNPs have led to high 
predictive ability [31]. 

Here, we provide a proof-of-concept that the GRS for CD, 
induced by LI -penalized support vector machine models, are able 
to achieve a predictive capacity and robustness that provides 
information not afforded by current diagnostic pathways utilizing 
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HLA typing alone. This GRS has the potential to provide greater 
clinical diagnostic utility by enabling each individual to be 
assigned a more informative risk score beyond the simple 
designation of "CD susceptible" or "CD non-susceptible", or 
"high risk" versus "low risk". To enable useful comparisons 
between diagnostic approaches, we model the GRS as a stand- 
alone test to "diagnose" CD, while at the same time acknowledg- 
ing that real world clinical practice will need to draw upon clinical 
history, CD-specific serology and small bowel histology to confirm 
the diagnosis of CD. We assess the predictive power of the GRS 
both in cross-validation and in external validation, across six 
different European cohorts, showing that the models strongly 
replicate. We test our GRS on three other autoimmune diseases: 
type 1 diabetes, Crohn's disease, and rheumatoid arthritis, finding 
some predictive ability for T ID status but none for the others, thus 
largely supporting the specificity of the scores for CD. To 
overcome limitations of previous studies utilizing GWAS case/ 
control studies, where ascertainment bias incurs substantially 
higher rates of false positive results, we undertake genomic 
prediction of CD in "real world" settings where the prevalence of 
CD is far lower and evaluate the performance of the GRS using 
PPV and NPV at several levels of CD prevalence. Unlike HLA 
typing, the GRS allows flexibility in determining who is considered 
at higher risk for CD by selecting a clinically determined user- 
specified threshold. We demonstrate how these scores can be 
practically applied at various prevalence levels to optimize 
sensitivity and precision. Finally, we show how the model can be 
calibrated to produce accurate predicted probabilities of disease. 

Results 

An overview of our analysis workflow is shown in Figure 1 . We 
analyzed five CD datasets on the Illumina Infinium array 
platform: UK1 and UK2 (British descent), Finn (Finnish descent), 
IT (Italian descent), and JVL (Dutch descent). We also utilized a 
dataset run on a fine-mapping array: IMM (Immunochip of British 
descent) (see Methods and Table 1). We have previously analyzed 
the AUC achievable in the UK1 and UK2 [30]. 

We trained Ll-penalized support vector machines (SVM) [26] 
on the genotype data, including all post quality control (QG) 
autosomal SNPs unless otherwise indicated. These models are 
sparse models, (Methods) and varying the penalty induces models 
based on different number of SNPs with non-zero coefficients. We 
investigated the performance for various degrees of sparsity, and 
the models were fitted to all SNPs across the genome simulta- 
neously. For each model's induced risk score, we estimated the 
area under the receiver operating characteristic curve (AUC) and 
the explained phenotypic variance [32], which depends on the 
assumed prevalence of disease. 

Cross-Validation in Each Dataset 

Within each dataset, we used 10x10 fold cross-validation to 
estimate the AUC and explained phenotypic variance on the 
liability scale. The explained phenotypic variance was derived 
from the AUC assuming a population prevalence of K= 1 % . All 
cohorts showed high AUC in cross-validation (Figure 2a), with the 
Finnish and Italian cohorts having a maximum of 0.89, followed 
by the UK1 cohort (AUC = 0.88), and finally the UK2 and Dutch 
cohorts with a maximum AUC of 0.87. Both the UK1 and the 
Italian cohorts peaked at ~64 SNPs with non-zero weights, 
whereas the rest peaked at ~250 SNPs. Subsampling of the 
individuals in the UK2 dataset indicated diminishing returns with 
80% of the sample size having the same AUC as 100% (Figure 
SI). Consistent with this, combining the UK1 and UK2 datasets 



did not increase AUC beyond UK2 alone (results not shown). It is 
also important to note that some of the control samples were 
population-based and were not explicidy screened for celiac 
disease, thus ~ 1 % may be cryptic CD cases which potentially 
underestimates prediction performance in downstream analyses. 
These AUCs correspond to explained phenotypic variance of 30- 
35% (Figure 2b). Assuming a CD heritability of 80%, this 
translated to an explained genetic variance of 37-43%. 

External Validation between Datasets 

While cross-validation provides an estimate of the model's 
ability to generalize to unseen datasets, choosing the model with 
the highest AUC may lead to so-called "optimization bias" (also 
called "winner's curse") [33,34], potentially manifesting as lower 
performance in independent validation. Additionally, cross-vali- 
dation cannot compensate for intra-dataset batch effects, as these 
would be present in both the training and testing folds, potentially 
artificially inflating the apparent predictive ability. To assess 
whether the models suffered from optimization bias and to control 
for the possibility of intra-dataset batch effects, we performed 
external validation. Based on the results of the cross-validation, we 
selected the best models trained on the UK2 dataset then, without 
any further tuning, tested them on the UK1, Finn, IT and NL 
datasets and computed the receiver-operating characteristic 
(ROC) curves (Figure 3a). Overall, the models trained on the 
UK2 cohort showed high reproducibility on the other cohorts, 
achieving AUC of 0.89-0.9 in the Finnish and UK1 datasets, 
indicating negligible optimization bias from the cross-validation 
procedure. We also examined the replication of different SNP sets 
(all autosomal, MHC, and non-MHC) trained on the UK2 dataset 
and tested on the others (Figure S2). The trends observed in cross- 
validation, namely similar performance for MHC and all 
autosomal SNPs and lower but still substantial performance of 
non-MHC SNPs, was observed in all external validation 
experiments. 

Comparison of Genomic Risk Score with Methods Based 
on HLA Typing 

Since HLA typing is commonly used for assessing CD risk 
status, we sought to compare the performance of an approach 
based on inferred HLA types with the GRS. We utilized the 
approach of Romanos et al. [21] on the Immunochip data, which 
relies on both HLA types and 57 non-HLA Immunochip SNPs 
(including one chrX SNP). Since directly measured HLA types 
were not available for our datasets, we imputed HLA-DQA1 and 
HLA-DQB1 haplotype alleles using HIBAG [35] and derived the 
presence of DQ2.2/DQ2.5-homozygous/DQ2.5-heterozygous/ 
DQ8 heterodimer status. The coefficients for the HLA risk types 
in the HLA+57 SNP model were not available for the Romanos et 
al. method, thus we had to estimate these from our data. For 
application of the GRS method, we trained models on 18,309 
autosomal SNPs from UK2 (the subset shared between the 
Illumina 670 and Immunochip) then externally validated these 
models on the Immunochip data. We trained three separate 
models: All autosomal SNPs, MHC SNPs, and autosomal non- 
MHC SNPs. As seen in Figure 3b, the GRS trained on either all 
SNPs or the MHC SNPs yielded higher AUC (0.87) than the 
Romanos HLA+57 SNPs (AUC = 0.85) or HLA type alone 
(AUC = 0.8). The predictive power of the GRS induced by SNPs 
outside the MHC was lower but still substantial at AUC = 0.72. 
We also performed similar analyses on the rest of the datasets 
(UK2 in cross-validation then externally validated on the rest), 
comparing the GRS with the HLA type and with analysis of HLA 
tag SNPs [36] commonly used to infer HLA types since the 57 
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Quality controlled datasets 





UK2 
1849 cases 
4936 controls 




UK1 

778 cases 
1422 controls 




Finn 

647 cases 
1829 controls 




IT 

497 cases 
543 controls 




NL 

803 cases 
846 controls 




IMM 

5907 cases 
4397 controls 







Predictive capacity (AUC) 
in cross-validation 



Select best 
UK2 model 



Combined 
Finn+NL+IT 



UK2GRS 



Predictive capacity (AUC) 
in independent validation 



Assessment of performance in realistic settings 



Prevalence independent 

• Sensitivity 

• Specificity 

• Risk score distribution 



Prevalence dependent 

• Positive predictive value 

• Negative predictive value 

• Odds of non-CD:CD 




Figure 1. The analysis workflow. 

doi:1 0.1 371 /journal.pgen.1 0041 37.g001 
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Table 1. List of celiac disease datasets used in this study. 





Name 


Ethnicity 


Platform 


Autosomal SNPs 
post-QC 


Male 


Female 


Cases 


Controls 


Total samples 
post-QC 


Finn 


Finnish 


lllumina 


513,952 


1206 


1270 


647 


1829 


2476 


IT 


Italian 


lllumina 


515,641 


332 


708 


497 


543 


1040 


NL 


Dutch 


lllumina 


515,169 


752 


897 


803 


846 


1649 


UK1 


British 


lllumina 


301,659 


938 


1262 


778 


1422 


2200 


UK2 


British 


lllumina 


515,444 


2954 


3831 


1849 


4936 


6785 


IMM 


British 


Immunochip 


18,252* 


3927 


6377 


5907 


4397 


10,304 



*only SNPs in common with the post-QC UK2 dataset were analysed and are thus shown here. 
doi:1 0.1 371 /journal.pgen.1 0041 37.t001 



non-HLA SNPs used by Romanos were not available on these 
platforms. As shown in Figure S2, the HLA type approach had 
consistently lower AUC (0.795-0.86) than analysis of the 
individual HLA tag-SNPs of Monsuur (AUC of 0.85-0.876, 
directly modeled using logistic regression on the SNPs in the UK2 
then tested on the other datasets) and substantially lower than the 
GRS (AUC of 0.86-0.894). 

Overall, our results showed that the LI -penalized SVM 
approach which modeled the SNPs directly was able to extract 
more information from the HLA region than the coarse-grained 
HLA haplotype model, either with or without the addition of the 
57 non-HLA SNPs. This resulted in a gain in explained 
phenotypic variance of 3.5% over the best Romanos et al model 
in the Immunochip data. 

Specificity of the Genomic Risk Score 

We investigated whether the models of CD were predictive of 
case/ control status in other immune-mediated diseases, specifically 
type 1 diabetes (T1D), rheumatoid arthritis (RA), and Crohn's 



Disease/Inflammatory Bowel Disease (Crohn's) from the 
WTCCC [37]. We utilized the 76,847 post-QC SNPs that 
appeared on both the UK2 lllumina and WTCCC Affymetrix 
500K arrays. Despite the substantial reduction in the number of 
SNPs from the original data, we observed only small reductions in 
AUC in the restricted UK2 dataset in cross-validation, indicating 
that most of the predictive information was retained in the reduced 
SNP set (AUC = 0.85 at -200 SNPs). The models trained on the 
UK2 were subsequently tested on the T1D, RA and Crohn's 
datasets. We also used the Finnish CD dataset as external 
validation to ensure that the high predictive performance observed 
in cross-validation on UK2 was replicated on other CD datasets 
and not degraded by using fewer SNPs. Overall, the models 
showed some predictive ability of T1D (AUC = 0.69), consistent 
with previous findings showing shared genetics between T1D and 
CD [38,39] (see Figure S3 for results for the MHC and non-MHC 
SNPs in T1D) but displayed very low performance (AUC 0.51- 
0.54) on the RA and Crohn's datasets. In contrast, performance on 
the Finnish CD cohort was only slightly lower (AUC = 0.85) 




1 4 16 64 256 1024 4096 1 4 16 64 256 1024 4096 

# of SNPs with non-zero weights # of SNPs with non-zero weights 



Figure 2. Building genomic models predictive of celiac disease. LOESS-smoothed (a) AUC and (b) phenotypic variance explained, from 10x10 
cross-validation, with differing model sizes, within each celiac dataset. The grey bands represent 95% confidence intervals about the mean LOESS 
smooth. 

doi:1 0.1 371 /journal.pgen.1 0041 37.g002 
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Figure 3. Performance of the genomic risk score in external validation, when compared to other approaches, and on other related 
diseases. ROC curves for models trained in the UK2 dataset and tested on (a) four other CD datasets, (b) the Immunochip CD dataset, comparing the 
GRS approach with that of Romanos et al. [21], and (c) three other autoimmune diseases (Crohn's disease, Rheumatoid Arthritis, and Type 1 Diabetes). 
We did not re-tune the models on the test data. For (b) and (c), we used a reduced set of SNPs for training, from the intersection of the UK2 SNPs with 
the Immunochip or WTCCC SNPs (1 8,252 SNPs and 76,847 SNPs, respectively). In (c), the same reduced set of SNPs was used for the CD-Finn dataset, 
in order to maintain the same SNPs across all target datasets. 
doi:1 0.1 371 /journal.pgen.1 0041 37.g003 



compared with the full SNP set, again confirming that the CD 
models replicated across ethnic cohorts despite using a reduced set 
of SNPs (Figure 3c). 

Analysis of a Combined Dataset 

All CD datasets showed consistently high AUC both in cross- 
validation and in external validation, indicating that the risk of 
substantial confounding of the case/control status by ethnic cohort 



(via population stratification or strong intra-cohort batch effects) 
was low. Therefore, in order to increase statistical power in 
comparing the performance of the models, we created a combined 
dataset consisting of the Finnish, Dutch, and Italian cohorts, 
totaling 5158 samples (1943 cases and 3215 controls, 512,634 
SNPs). This combined dataset is likely more representative of a 
real screening scenario where individuals of different ethnicities 
are being screened for CD. Figure 4a shows kernel density 
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Figure 4. Distribution of genomic risk scores in cases and controls, (a) Kernel density estimates of the risk scores predicted using models on 
UK2 and tested in the combined dataset Finn+NL+IT, for cases and controls, (b) Thresholds for risk scores in terms of population percent, with the top 
more likely to be a CD and the bottom more likely to be non-CD. 
doi:1 0.1 371 /journal.pgen.1 0041 37.g004 



estimates of the predicted risk scores for cases and controls in the 
combined dataset, where the scores are based on models trained 
on the UK2 dataset as previously described. As expected from the 
high AUC, there is substantial separation between the score 
distributions for the two classes. Also shown is the percentage of 
the combined population corresponding to a range of GRS 
thresholds (Figure 4b). 

Positive and Negative Predictive Values under Different 
Prevalence Settings 

The prevalence of CD in the general population, here taken to 
be 1%, is much lower than the prevalence in the case/control 
datasets where the cases are substantially over-represented owing 
to the study design. Considering the prevalence as the prior 
probability of a person having the disease (without knowing their 
genetic profile), then unless the likelihood of disease given the 
genotype is high as well, the posterior probability of disease will 
remain low. To quantify the predictive performance of our 
models while accounting for the prevalence, we estimated the 
precision of our models trained on UK2 on the Finn+NL+IT 
combined dataset. We down-sampled the cases in the combined 
dataset to simulate settings with different CD prevalence levels 
(1%, 3%, 10%, and 20%) and estimated precision and sensitivity 
in the test data, repeated in 50 independent simulations for each 
prevalence level (Figure 5a). The precision here is equivalent to 
the PPV [40] as the precision is estimated in data with the same 
prevalence as assumed by the PPV. The PPV is the posterior 
probability of having the disease given a positive diagnosis, and 
the NPV is the posterior probability of not having the disease 
given a negative diagnosis (a perfect model offers 
PPV = NPV = 1). Note that the lowest NPV achievable is 



1 —prevalence which translates to seemingly high NPV values 
in the low-prevalence setting, rendering NPV less useful for 
assessing classifiers in such settings as even a weak classifier can 
achieve apparently high NPV. 

Population screening for CD is not currently accepted practice. 
Most evidence supports an active case-finding strategy where patients 
with risk-factors for CD, and therefore higher pre-test probability of 
CD than the population-wide average, are identified by their 
primary practitioner and screened. For example, the prevalence of 
CD in patients with a first-degree relative with CD is 10% or higher 
[41,42], and the prevalence of CD in patients with T1D ranges from 
3-16% [43]. The increased CD prevalence in these groups of 
patients improves the apparent 'diagnostic' performance of the GRS. 
To examine the effect of prevalence on PPV, we first employed the 
GRS in a population-based setting (prevalence of 1 %) which resulted 
in a PPV of -18% at a threshold that identified 20% of the CD 
cases, but this dropped to ~3% at a threshold identifying 85% of the 
CD cases. In contrast, performance in more clinically relevant 
settings with higher CD prevalence was substantially better. For 
instance, the PPV increased from ~ 1 8% at 1 % prevalence to ~40% 
at 3% prevalence, and to ~70% at 10% prevalence, with the 
sensitivity setting at 20% (Figure 5a). At 10% prevalence, increasing 
the GRS sensitivity to 60% resulted in a PPV of 40%, and at a 
sensitivity of 80% the PPV was ~30%. There were small differences 
in the AUC between prevalence levels, on the order of 1-3%, 
however all settings maintained high AUC at >0.86 (Figure 5b). 
Since sensitivity and specificity are independent of prevalence, these 
differences are likely due to the small number of cases in the low- 
prevalence settings and stochastic variations in the data caused by 
randomly sampling cases from different ethnicities, as each ethnicity 
showed slightly different predictability of CD in external validation, 
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Figure 5. Performance at different prevalences and partial ROC curves, (a) Positive and negative predictive values and (b) partial ROC curves 
for models trained on UK2 using 228 SNPs in the model, and tested on the combined Finn+NL+IT dataset. K represents the prevalence of disease in 
the dataset and the curves are threshold-averaged over 50 replications. Note that precision is not a monotonic function of the risk score. Precision is 
equivalent to PPV here. A prevalence of —10% corresponds to prevalence in first-degree relatives of probands with CD [42]. 
doi:1 0.1 371 /journal.pgen.1 0041 37.g005 



together with clinical heterogeneity resulting from different numbers 
of cryptic cases in the controls of each cohort. 

Non-Disease Cases Implicated per True Disease Case 

Another way to quantify the usefulness of predictive models as 
diagnostic tools is to evaluate the number of subjects without CD 
that are incorrecdy identified as potential CD cases per each true 
CD diagnosis, and to do so at different levels of clinical risk 
(prevalence). This measure is equivalent to the posterior odds of 
not having CD given the genotypes (1 - PPV)/PPV, where a lower 
number is better (fewer incorrect cases implicated per true CD 
case). Figure 6 shows that at a sensitivity threshold to detect 20% 
of CD cases, the odds of incorrectly implicating CD were ~7: 1 at 
prevalence of 1%, but this decreased to ~1:2 and ~1:5 at a 
prevalence of 10% and 20%, respectively. Further, at 10% CD 
prevalence the odds of incorrectly implicating CD were less than 
1:1 while maintaining sensitivity of more than 30%, and for 20% 
CD prevalence up to 80% of true CD cases could be detected with 
such odds. 

Application of the Genomic Risk Score 

The application of the GRS is straightforward: once the SNPs in 
the model have been genotyped for a given patient, the GRS can 
be easily computed as the sum of the SNPs weights times the allele 
dosages plus an intercept term (Methods and Table SI). Our 
models consist of ~200 SNPs, hence the score can be easily 
computed in a spreadsheet or with PLINK. Whereas the models 
are fixed in the training phase, the interpretation of the scores 
depends on the screening setting in which they are used since 
selection of different risk thresholds leads to different false positive 
and false negative rates. In other words, the same numerical risk 
score may be interpreted differently in each setting, depending on 
the performance criteria required by the clinician, such as a 



minimum level of sensitivity or a maximum number of non-CD 
implicated per true CD implicated. 

Figure 7 illustrates how the GRS could be applied in two 
commonly encountered but different clinical settings to (i) exclude 
individuals at average (background) risk of CD with high 
confidence, or to (ii) stratify individuals at higher risk of CD for 
further confirmatory testing. In the first setting, in order to 
optimize the NPV, a suitably low GRS threshold is selected, 
leading to a relatively large proportion of the population being 
considered as potentially at-risk of CD. An NPV of 99.6% 
(comparable to HTA testing) can be achieved at the population- 
wide 1% prevalence by setting a threshold corresponding to 
designating 15% of the population as CD cases (PPV of 5%). In 
the second setting, we modeled a scenario where the risk of CD is 
increased (for instance in patients with suggestive symptoms or 
clinical conditions) and risk stratification is sought to identify the 
patients most likely to benefit from further definitive investigation 
for CD. The prevalence of CD in those with higher-risk 
symptoms is approximately 3% [3,44] and in first-degree relatives 
of CD patients it is 10% [41,42]. In this second setting, we 
highlight two extreme choices of threshold as an example of what 
is achievable using the GRS at each prevalence level. The first 
threshold is stringent, predicting only a small number of high- 
confidence individuals as likely to have CD and subsequently 
leading to low sensitivity but higher PPV. The second threshold is 
low, implicating a larger number of individuals as likely to have 
CD and leading to higher sensitivity at the expense of reduced 
PPV. 

More detailed results for a range of prevalence levels are 
shown in Table S2. These consider different cutoffs of the risk 
score expressed as a proportion of the population implicated 
for CD. We used the proportion of the population rather than 
proportion of the cases (sensitivity) to select risk thresholds 
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Figure 6. Clinical interpretation as a function of threshold and prevalence. The number of non-CD cases "misdiagnosed" (wrongly 
implicated by GRS) per true CD cases "diagnosed" (correctly implicated by GRS), for different levels of sensitivity. The risk score is based on a model 
trained on the UK2 dataset, and tested on the combined Finn+NL+IT dataset. The results were threshold-averaged over 50 independent replications. 
Note that the curve for K= 1% does not span the entire range due to averaging over a small number of cases in that dataset. 
doi:1 0.1 371 /journal.pgen.1 0041 37.g006 



since the true number of cases is unknown and we must select 
how to classify a given individual based only on their score 
relative to the population scores estimated in our data 
(Figure 4). As expected, sensitivity and specificity remain 
unchanged between the prevalence levels using the same risk 
score cutoff, however PPV, NPV and consequently the number 
of people incorrectly implicated as CD for each true CD case, 
depend strongly both on the prevalence and on the cutoff. 
Therefore, at a given prevalence level a suitable risk score 
cutoff can be selected in order to balance the two competing 
requirements of increasing the number of people correctly 
identified as having CD per true cases (PPV) while maintaining 
an acceptable level of sensitivity (coverage of the cases). A 



major benefit of the GRS is its flexibility in adapting to the 
appropriate clinical scenario and needs of the clinician. The 
PPV of the GRS can be adjusted up or down by varying the 
GRS cutoff and considering the acceptable level of sensitivity 
to detect CD. In practice, the most clinically appropriate cut- 
off thresholds would ideally be determined in local populations 
by undertaking prospective validation studies utilizing the 
GRS (See Discussion). 

Risk Score Calibration 

While the raw GRS cannot strictly be interpreted as the 
probability of disease given the genotypes, since it hasn't been 
normalized to be between 0 and 1, the score can be transformed 
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Figure 7. Example clinical scenarios. The GRS can be employed in different clinical scenarios and tuned to optimize outcomes. The GRS can be 

employed in a comparable manner to HLA testing (left table) to confidently exclude CD. In this scenario, we selected a GRS threshold based on 
NPV = 99.6% however a range of thresholds can be selected to achieve a high NPV (see note below). The GRS can also stratify CD risk (right table). 
Confirmatory testing (such as small bowel biopsy) would be reserved for those at high-risk. In this example, we present two scenarios: optimization of 
PPV or of sensitivity. In comparison to the GRS, all HLA-susceptible patients will need to undergo further confirmatory testing for CD. For more 
information on GRS performance across a range of thresholds, see Table S2. Prospective validation of the GRS in local populations would enable the 
most appropriate settings for NPV, PPV and sensitivity to be identified which provide the optimal diagnostic outcomes. + The highest achievable NPV 
at 10% prevalence was 99.4%. 
doi:1 0.1 371 /journal.pgen.1 0041 37.g007 



into a probability using the empirical distribution of scores in the 
data (Figure 4). To assess the agreement between the predicted 
probability of disease and the observed probability of disease, we 
used calibration plots [45] that compared the predicted 5% 
quantiles of the risk scores, derived from models trained on the 
UK2 dataset and externally applied to the other datasets, with the 
observed probability of cases in each bin. For a well-calibrated 
GRS, the proportion of cases to samples in each bin should be 
approximately equal to the predicted risk. To correct for potential 
lack of calibration, we fitted a LOESS smooth to the calibration 
curve, which was then used to adjust the raw predictions into 
calibrated predictions. To avoid biasing the calibration step and to 
assess how well it performed in independent data, we randomly 
split each external dataset (Finn, IT, NL, and UK1) into two 
halves of approximately equal size. We assessed calibration in the 
first half of each dataset and fitted a LOESS smooth to the 
calibration curve (Figures S4a and S4c). We then used the LOESS 
smooth to calibrate the predictions for the other half of each 



dataset and assessed the calibration there (Figures S4b and S4d). 
Since the calibration is affected by prevalence, we assessed this 
procedure both in the observed data (prevalence of ~40%) and in 
a subsampled version with prevalence of ~10%. Overall, our 
calibration procedure was able to correct for a substantial amount 
of mis-calibration in the raw scores, even in the more challenging 
case of 10% prevalence. 

Discussion 

In this study, we have sought to exploit the strong genetic basis 
for CD and leverage comprehensive genome-wide SNP profiles 
using statistical learning to improve risk stratification and the 
diagnosis of CD. Our models showed excellent performance in 
cross-validation and were highly replicable in external validation 
across datasets of different ethnicities, suggesting that the genetic 
component is shared between these European ethnicities and that 
our models were able to capture a substantial proportion of it. 
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Importantly, even without explaining a majority of CD heritabil- 
ity, the models were robust and accurate, showing that it is not 
necessary to explain most of the heritability in order to produce a 
useful model. 

The most frequently employed tools to diagnose CD are 
serology and small bowel histology, but both have limitations. 
Differences in the sensitivity of antibody recognition of commer- 
cially employed CD-specific antigens such as tissue transglutamin- 
ase, deamidated gliadin peptides, and endomysial antigen, as well 
as the human operator performing the assay can all influence 
findings and affect reproducibility of serological testing [9,46-49] . 
Serologic testing in children is reported to be less reliable before 
the age of 4 and up to 50% of children normalize elevated 
antibodies over time [50,51]. While small bowel histology remains 
the 'gold standard' confirmatory test, it is dependent upon patients 
willing and available to undergo endoscopy, adequate sampling by 
the gastroenterologist, and appropriate pathological processing 
and interpretation [52-54]. The frequencies of false positives and 
false negatives in CD serology assays vary widely and also partly 
depend upon what degree of histologic inflammation is considered 
compatible with CD [52,54-58]. Notably, the accuracy of both 
serologic and histologic testing for CD is dependent on the 
ongoing consumption of gluten. It is clear that clinically significant 
variability exists in serologic and histologic work-up for CD and 
new tools to improve the accuracy of CD diagnosis would be of 
benefit to clinicians. Given the strong genetic basis for CD, 
genomic tools are logical and appealing because they are relatively 
robust and less subject to the kind of variability seen with serologic 
and histologic assessment, are independent of age, and do not rely 
on dietary intake of gluten. 

A major shortcoming of clinical HLA typing for risk prediction 
of CD is its poor specificity. HLA testing would result in virtually 
all CD cases detected but at the cost of approximately 30-56 
people incorrectly implicated for each true case of CD. A 
significant advantage of the GRS approach is that it can be 
adapted to the clinical scenario in order to maximize PPV and 
diagnostic accuracy. By promoting accurate clinical stratification, 
the GRS could reserve invasive and more expensive confirmatory 
testing for those who would most likely benefit from further 
investigation to secure a diagnosis, and it would avoid unnecessary 
procedures in those who are HLA susceptible but unlikely to have 
CD. This provides both clinical and economic benefits. HLA 
typing does not provide the flexibility afforded by the GRS and 
cannot be effectively employed to identify those who would benefit 
from endoscopy. For instance, if HLA typing were used as a guide 
for further investigations, at 10% CD prevalence it would generate 
over five unnecessary endoscopies per correct endoscopy and at 
1% CD prevalence it would generate 30-56 unnecessary 
endoscopies. Small bowel endoscopy is not a trivial undertaking 
- the procedure is costly (approximately AUD $750-$ 1000 for the 
procedure and associated pathology), has potential complications, 
necessitates a full day off work, and many patients are reluctant to 
undergo it. 

The GRS can be used to exclude patients unlikely to have 
CD with a performance comparable to HLA typing. Testing 
with these parameters may be useful in the clinical scenario of 
assessing individuals at average risk of CD. A common example 
would be when a person has commenced a gluten-free diet prior 
to assessment for CD by serology or small bowel examination 
and are unwilling or unable to resume oral gluten intake in 
order to make testing reliable. This is an increasingly common 
clinical dilemma as the number of people following a gluten- 
free diet without adequate initial testing for CD continues to 
rise. In the United States approximately 30% of the adult 



population are interested in cutting back or avoiding dietary 
gluten [59]. 

The GRS can also be used to stratify the risk for CD in 
patients who present with suggestive clinical features. These risk 
factors include having a first-degree relative with CD or 
problems such as recurrent abdominal pain, bloating, diarrhea 
or constipation, fatigue, weight loss, unexplained anemia, 
autoimmune disease (including thyroid disease, T1D, autoim- 
mune hepatitis, rheumatoid arthritis, and Sjogren's syndrome), 
infertility or early-onset osteoporosis [3,60]. Supporting the 
recently revised diagnostic guidelines for CD, which promote 
HLA testing as the 1 st line investigation for higher-risk cases, 
genetic testing of CD is likely to be more informative in these 
sub-populations exhibiting higher-than-normal prevalence. 
While clinical guidelines recommend screening for CD in these 
high-risk populations [61], testing often poses a diagnostic 
dilemma as serologic assessment alone cannot confidently 
exclude a diagnosis, especially given the higher pre-test 
probability. HLA typing is not particularly informative as the 
CD HLA susceptibility haplotypes HLA-DQ2.5 and DQ8 are 
commonly present (manifesting in over 90% of patients with 
T1D and in 65% in first-degree relatives of individuals with CD) 
[62,63]. Stratifying these higher-risk patients based on a GRS 
will allow improved identification of those where small bowel 
biopsy is likely to be informative. Thus, a GRS should reduce 
the number of unnecessary small bowel biopsies in first-degree 
relatives who carry HLA susceptibility for CD but do not have 
it. We have found that our CD models had only moderate 
predictive ability for T1D, which is consistent with previous 
findings showing some shared genetics between T1D and CD 
[38]. Despite the substantial overlap of genetic factors for 
autoimmune disease, the CD models had negligible predictive 
ability for Crohn's disease and rheumatoid arthritis. These 
results indicate that our GRS is specific to CD and less likely to 
incorrectly identify patients with other autoimmune diseases as 
having CD, but further work is required to determine whether 
CD can be as confidently predicted in individuals with T ID as it 
is in non-TID populations. 

Another major clinical challenge that may benefit from 
genomic risk prediction is determining the natural history of 
potential CD (formerly termed 'latent CD') when there is 
serologic but not histologic evidence of CD, and identifying 
which patients are more likely to develop overt CD with small 
bowel inflammation [64]. Current practice is to follow-up all 
patients with immunologic evidence of gluten intolerance in order 
to capture those who will eventually develop overt disease. An 
analogous clinical scenario is that of children with positive CD 
serology, of whom 50% will fail to develop small bowel changes 
consistent with CD during follow-up [50,51]. In both clinical 
situations, it is reasonable to expect that a GRS can improve risk 
stratification of such patients for developing overt CD. Of course, 
environmental factors are important in the development of CD 
and the exact extent to which environmental versus genetic 
factors contribute to the development of overt CD remains 
unknown. Long-term follow-up studies of patients with potential 
CD will be necessary to establish the role of genomic risk 
prediction in this important subgroup. 

Future work will look at optimizing our GRS as a tool to predict 
CD risk. Validation of our model in real-life practice will be 
important to confirm the clinical benefit of the GRS in 
conjunction with serology and/ or over HLA typing alone, as well 
as to what extent other clinical predictors such as sex, age, and 
family history can contribute to clinically relevant risk prediction. 
Future prospective studies will enable direct optimization of 
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clinical utility (accuracy, practicality, throughput and cost) 
afforded by the GRS, for example in conjunction with CD 
serology. These studies will also provide a rigorous evidence base 
for suggested clinical guidelines of GRS usage. Importantly, 
appropriate GRS cut-off levels to maximize diagnostic accuracy 
(optimal PPV and NPV for each given clinical scenario and CD 
prevalence) could be obtained by local prospective validation. 
Such studies can identify the ultimate clinical role for the GRS: 
whether it can effectively replace HLA typing and also whether it 
is a stand-alone test or one to accompany CD serology. Hadithi et 
al showed that in patients at high-risk of CD the addition of HLA 
typing to CD serology had the same performance as either testing 
strategy alone [65], but the greater precision of the GRS over 
HLA typing may better complement CD serology. Understanding 
where the GRS fits in the diagnostic algorithm to optimize 
precision and cost-effectiveness will be essential, as is the role it 
might play in the diagnostic work-up of CD in populations with 
lower levels of clinical risk. Health economic modeling will address 
the cost-benefits of using the GRS in the diagnosis of CD, taking 
into account the cheaper cost of GRS over HLA typing, and 
include the downstream benefits of potentially reducing endosco- 
pies (substantial cost savings and value to patients from reduced 
discomfort) as well as potential improvements in quality of life 
from the detection of CD. 

Further, it may be that other statistical modeling approaches 
yield improvements in predictive power, for example non-additive 
models that consider epistatic interactions between SNPs. Another 
avenue for improvement is considering each CD subtype 
separately, recognizing potentially different genetic bases for these 
conditions. Based on our results, we do not expect substantial 
improvements from increasing sample size alone, however this will 
be important for adequately powered studies of lower frequency 
genetic variants of assumedly greater effect size. 

In summary, this study demonstrates that simultaneous 
modeling of all SNPs using statistical learning was able to 
generate genomic risk scores that accurately predict CD to a 
clinically relevant degree. This was despite the models 
explaining only a minority of disease heritability. The GRS 
better enables clinicians to stratify patients according to their 
risk of CD compared to HLA typing alone and, we predict, 
more accurately determines those suitable for confirmatory 
testing in the form of small bowel biopsy. Reserving this 
invasive, time consuming and costly procedure for higher-risk 
cases is likely to improve the accuracy, cost and public 
acceptance of testing for CD, and by extension, benefit the 
overall diagnosis of CD in the community. By better prioritizing 
higher-risk patients for confirmatory testing, genomic risk 
prediction carries promise as a clinically useful tool to add to 
the clinician's diagnostic armamentarium. Ultimately, we 
envisage a clinical scoring algorithm based on the combination 
of clinical features, serologic, and genetic information that will 
accurately predict people with biopsy-confirmed CD and 
perhaps ultimately overcome the reliance on small bowel 
histology altogether. Further, the costs of genotyping a select 
number of marker SNPs with a low-plex, high throughput 
technology are already far lower than the costs of full HLA 
typing, resulting in a test that is cheaper, more flexible and 
more precise than HLA typing. More generally, this study 
demonstrates that statistical learning approaches utilizing SNPs 
can already produce useful predictive models of a complex 
human disease using existing genotyping platforms assaying 
common SNPs and suggests that similar approaches may yield 
comparable results in other complex human diseases with 
strong genetic components. 



Methods 

Ethics Statement 

All participants gave informed consent and the study protocols 
were approved by the relevant institutional or national ethics 
committees. Details given in references van Heel et al [23] and 
DuBois et al [24]. All data was analysed anonymously. 

Data 

We analyzed six CD datasets: UK1 [23], UK2, IT, ML, and Finn 
[24], and IMM [66]. The main characteristics of the datasets are 
listed in Table 1. In addition we used three WTCCC datasets 
(T1D, Crohn's, and RA) that have been described elsewhere 
[30,37]. UK1 used the Illumina Hap330vl-1 array for cases and 
Hap550-2v3 for controls, UK2 used the Illumina 670-QuadCus- 
tom-vl for cases and 1.2M-DuoCustom-vl for controls, the NL 
and IT datasets used the Illumina 670-QuadCustom-vl in both 
cases and controls, and the Finn dataset used the Illumina 670- 
QuadCustom-vl for cases and Illumina 610-Quad for controls. 
The WTCCC data (T1D, Crohn's, and RA) used the Affymetrix 
500K array. In all of our models, we used autosomal SNPs only, 
and did not include the gender as a covariable, as models built 
separately on the two genders using the same sample size and 
casexontrol balance showed very similar performance in cross- 
validation on the UK2 dataset (results not shown). For analyses of 
the MHC region, we defined the MHC as all SNPs on chr6 in the 
range 29.7 Mb-33.3 Mb. 

Quality Control 

For each of the UK1, UK2, IT, NL, and Finn datasets, we 
removed non-autosomal SNPs, SNPs with MAF<1%, with 
missingness >1%, and those with deviations from Hardy- 
Weinberg Equilibrium in controls P<5xl0 6 . We also removed 
samples with missingness >1%. We tested identity-by-descent 
between samples in UK 1 and UK2 and removed one of a pair of 
samples with pi-hat SO. 05 (either between the datasets or within 
the datasets). The QC for the IMM Immunochip data has been 
previously described [66]; we estimated 5763 Immunochip 
samples to have pi-hat >0.125 (PLINK IBS) with any UK2 
sample, and those were removed, leaving 10,304 Immunochip 
samples in total, with 18,252 SNPs shared with the UK2 dataset 
(post-QC). The QC for the WTCCC data (T1D, Crohn's, and 
RA) has been previously described [30,37]. 

Assessment of Population Structure Effects 

To assess the impact of potential cryptic population structure, 
we estimated the top 10 principal components (PCs) for the UK2 
with EIGENSOFT 4.2 [67], after removal of regions with high LD 
(see Text SI for details). The principal components themselves 
showed almost no predictive ability (AUC = 0.52), and models 
trained on all SNPs accounting for these PCs showed indistin- 
guishable performance from the non-adjusted model, both in 
cross-validation on the UK2 dataset and in external validation on 
the Finn, NL, and IT datasets (Figure S5), demonstrating that 
confounding of our UK2 models by population structure was 
negligible and was not a contributing factor to the high predictive 
ability. 

Statistical Analysis 

We used LI -penalized support vector machines (SVM) 
implemented in the tool SparSNP [26] (https://github.com/ 
gabraham/SparSNP) as the classifiers. The LI -penalized SVM 
is a sparse linear model, that is, many or most of the SNPs will 
receive zero weight in the model, as determined by the LI 
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penalty. The use of a sparse model fits with our prior 
expectation that in autoimmune disease most SNPs will not be 
associated with disease status. The inherent sparsity of the 
model obviates the need for subsequent filtering of SNPs by 
weight, in order to decide which ones show strong evidence of 
association and which are spurious, as would be required in a 
non-sparse (L2 -penalized) model. In addition, in extensive 
simulation and in analysis of real genotype data, including the 
two celiac disease datasets UK1 and UK2, we have previously 
shown the advantage of LI -penalized SVMs over commonly 
used approaches such as polygenic scores (sum of the log odds), 
linear mixed models (GCTA), and unpenalized logistic regres- 
sion [30]. The advantage of sparse models over standard linear 
mixed models in predicting autoimmune disease has been 
recently confirmed in type-1 diabetes as well [68]. We have also 
shown that our LI -penalized SVMs achieved essentially 
identical performance to LI -penalized logistic regression 
(glmnet) in cross-validation over the Finnish subset of the celiac 
disease dataset, while being substantially faster [26]. Unlike 
single marker approaches that estimate the effect size of each 
SNP separately, the LI -penalized SVM is a multivariable 
model, where the estimated effect of each SNP is conditional on 
all other SNPs, thereby implicitly accounting for the linkage 
disequilibrium (LD) between SNPs. Besides imposing sparsity, 
the LI penalty tends to produce models where one represen- 
tative SNP is selected out of a group of highly correlated SNPs, 
while the rest remain with a zero weight, in contrast with L2- 
penalized or unpenalized models where many or all of these 
SNPs may receive a non-zero weight. For an in-depth discussion 
of these issues and the effects of varying LD levels on the 
performance of multivariable models, see [30]. 

The LI -penalized SVM model is induced by minimizing the 
LI -penalized squared-hinge loss over N samples and p SNPs, 

L(M) = 2^f>ax{0,l -M^ + A>)} 2 + ^E W 

!=1 j=l 

where x, is the /(-vector of genotypes for the z'fh sample in allele- 
dosage coding {0, 1 , 2 } , y are the binary phenotypes { — 1 , + 1 } , 
P is the /^-vector of weights, P 0 K the intercept (also called the 
bias, which is not penalized), and A is the LI penalty. We also 
investigated adding an L2 penalty to the model (elastic-net), 
however, based on initial cross-validation experiments, we 
found no advantage in the L2 penalty and subsequently did 
not use it. All of our models were additive in the allele dosage 
{0, 1, 2}. 

The genomic risk score j>, for a new sample x t consisting of/) 
genotypes is then 

p 

h = Po+ ^2 x ijPj 
7=1 

where the continuous value J, is later thresholded at different 
values to produce a binary predicted class. The model was 
evaluated over a grid of penalties, in 10-fold cross-validation, 
repeated 10 times. The optimal number of SNPs in the model was 
decided based on the model with the highest average AUC across 
the replications. The final model was a consensus model, averaged 
over all 10x10= 100 models, and containing approximately the 
number of SNPs determined earlier. Post processing and plotting 
of the results was performed in R [69] , together with the package 
ggplot2 [70]. 



Measures of Predictive Performance 

To quantify the predictive performance of the models in cross- 
validation and external validation, we employed receiver operating 
characteristic (ROC) curves (sensitivity versus 1 minus specificity), 
the area under the ROC curve (AUC) [71], and the proportion of 
phenotypic variance explained [32]. 

To quantify predictive performance in different population 
settings, we used the positive and negative predictive values, which 
can be estimated as 

ppv= sens x pre v 

sens x prev + (l —spec) x (1 —prev) 

and 

Npv= .spec x (1 -prev) 

spec x (1 — prev) + (1 — sens) x prev 

where "sens" is the sensitivity = TP/(TP+FN), "spec" is the 
specificity = TN/ (FP+TN), and "prev" is the population preva- 
lence. The PPV/NPV are equivalent to the posterior probability 
of a person having/not having the disease given a positive/ 
negative diagnosis, respectively. When the PPV and precision are 
estimated in data with identical prevalence (that is, the observed 
prevalence in the data is identical to the prevalence in the 
population for which we wish to estimate PPV), they are 
equivalent. Precision is defined as TP/ (TP+FP). 

Supporting Information 

Figure SI LOESS-smoothed AUC in lOxlO-fold cross-valida- 
tion for the random subsamples of the UK2 dataset, in increasing 
sample size proportions of the original data (n = 6785). 
(EPS) 

Figure S2 Results of externally validating the predictive 
models, trained on UK2 in cross-validation, and tested on the 
other CD datasets. Legend: Romanos HLA: 3-levels of risk (low, 
medium, high) [21] based on imputed HLA type (HIBAG); 
Romanos HLA+57 SNPs (Immunochip only): 3-level HLA risk plus 
57 Immunochip non-HLA SNPs [21]; Monsuur HLA SNPs: logistic 
regression on individual HLA SNPs [36] (5/6 SNPs or proxies 
thereof were found in the UK2/Finn/NL/IT datasets, 3/6 were 
found in the subset of UK1 shared with UK2); GRS MHC SNPs: 
SparSNP run on individual SNPs on chr6 within 29.7 
Mb-33.3 Mb; GRS non-MHC SNPs: SparSNP run on individual 
autosomal SNPs outside MHC; GRS all SNPs: SparSNP run on all 
autosomal SNPs. 
(EPS) 

Figure S3 ROC curves for CD model trained on SNP subsets of 
the UK2 dataset that were assayed for the WTCCC-T1D dataset: 
All SNPs (76,847 SNPS), MHC SNPs (186 SNPs in the MHC region 
of chr6, 29.7-33.3 Mb), and Non-MHC SNPs (76,661 SNPs outside 
the MHC). 
(EPS) 

Figure S4 Calibration plots, comparing predicted score in 5% 
quantiles against observed proportions of cases falling within the 
bin. The score comes from models trained on the UK2 dataset, 
and tested on the rest of the datasets. The bars show 95% 
confidence intervals using the Agresti-Coull method for 
proportions. We randomly split the test datasets into two halves. 
In the first half, we plotted the original quantiles of the scores 
and fitted a LOESS smooth to them. We did this for the original 
case/control data (prevalence of 40%), shown in (a), and for a 
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subsampled version of the data with prevalence of 10% (c). We 
then used the LOESS smooth to correct the original quantiles, 
forming a calibrated score, one for each dataset (Finn, IT, NL, 
UK1), which was then applied to the second half of the data, 
shown in (b) and (d) for prevalence of 40% and 10% 
respectively. The second half of the data was not used in the 
calibration step. 
(EPS) 

Figure S5 (a) LOESS-smoothed AUC from 10-fold cross- 
validation for the UK2 model (all autosomal SNPs), accounting 
for the top 10 PCs (included in training but not in testing), (b) 
External validation of the best UK2 model that accounted for the 
PCs (PCs excluded from testing). 
(EPS) 

Table SI The predictive model. The SNPs are sorted in 
decreasing order of the absolute value of their model weight 
averaged over the 10x10 cross-validation folds. Stability is the 
percentage of times a SNP was selected to have non-zero weight 
over the lOxlO-cross-validation folds. Intercept: —0.757226. To 
annotate the SNPs we used Bioconductor 2.12 together with the 
packages VariantAnnotation 1.6.5 and TxDb.Hsapien- 
s.UCSC.hgl8.knownGene 2.9.0. We considered a SNP to be 
genie if it was annotated to fall inside one of the regions 
{spliceSite, intron, fiveUTR, threeUTR, coding, promoter} and 
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